1,016 research outputs found
Quality Aware Network for Set to Set Recognition
This paper targets on the problem of set to set recognition, which learns the
metric between two image sets. Images in each set belong to the same identity.
Since images in a set can be complementary, they hopefully lead to higher
accuracy in practical applications. However, the quality of each sample cannot
be guaranteed, and samples with poor quality will hurt the metric. In this
paper, the quality aware network (QAN) is proposed to confront this problem,
where the quality of each sample can be automatically learned although such
information is not explicitly provided in the training stage. The network has
two branches, where the first branch extracts appearance feature embedding for
each sample and the other branch predicts quality score for each sample.
Features and quality scores of all samples in a set are then aggregated to
generate the final feature embedding. We show that the two branches can be
trained in an end-to-end manner given only the set-level identity annotation.
Analysis on gradient spread of this mechanism indicates that the quality
learned by the network is beneficial to set-to-set recognition and simplifies
the distribution that the network needs to fit. Experiments on both face
verification and person re-identification show advantages of the proposed QAN.
The source code and network structure can be downloaded at
https://github.com/sciencefans/Quality-Aware-Network.Comment: Accepted at CVPR 201
Data Reduction Methods of Audio Signals for Embedded Sound Event Recognition
Sound event detection is a typical Internet of Things (IoT) application task, which could be used in many scenarios like dedicated security application, where cameras might be unsuitable due to the environment variations like lights and movements. In realistic applications, usually models for this task are implemented on embedded devices with microphones. And the idea of edge computing is to process the data near the place where it happens, because reacting in real time is very important in some applications. Transmitting collected audio clips to cloud may cause huge delay and sometime results in serious consequence. But processing on local has another problem, heavy computation may beyond the load for embedded devices, which happens to be the weakness of embedded devices. Works on this problem have make a huge progress recent year, like model compression and hardware acceleration.
This thesis provides a new perspective on embedded deep learning for audio tasks, aimed at reducing data amount of audio signals for sound event recognition task. Instead of following the idea of compressing model or designing hardware accelerator, our methods focus on analog front-end signal acquisition side, reducing data amount of audio signal clips directly, using specific sampling methods. The state-of-the-art works for sound event detection are mainly based on deep learning models. For deep learning models, less input size means lower latency due to less time steps for recurrent neural network (RNN) or less convolutional computations for convolutional neural network (CNN). So, less data amount of input, audio signals gain less computation and parameters of neural network classifier, naturally, resulting less delay while interference. Our experiments implement three kind of data reduction methods on this sound event detection task, all of these three methods are based on reducing the sample points of an audio signal, using less sampling rate and sampling width, using sigma delta analog digital converter (ADC) and using level crossing (LC) ADC for audio signals. We simulated these three kinds of signals and feed them into the neural network to train the classifier
Finally, we derive the conclusion that there is still some redundancy of audio signals in traditional sampling ways for audio classification. And using specific ADC modules better performance on classification with the same data amount in original way
End-to-end Flow Correlation Tracking with Spatial-temporal Attention
Discriminative correlation filters (DCF) with deep convolutional features
have achieved favorable performance in recent tracking benchmarks. However,
most of existing DCF trackers only consider appearance features of current
frame, and hardly benefit from motion and inter-frame information. The lack of
temporal information degrades the tracking performance during challenges such
as partial occlusion and deformation. In this work, we focus on making use of
the rich flow information in consecutive frames to improve the feature
representation and the tracking accuracy. Firstly, individual components,
including optical flow estimation, feature extraction, aggregation and
correlation filter tracking are formulated as special layers in network. To the
best of our knowledge, this is the first work to jointly train flow and
tracking task in a deep learning framework. Then the historical feature maps at
predefined intervals are warped and aggregated with current ones by the guiding
of flow. For adaptive aggregation, we propose a novel spatial-temporal
attention mechanism. Extensive experiments are performed on four challenging
tracking datasets: OTB2013, OTB2015, VOT2015 and VOT2016, and the proposed
method achieves superior results on these benchmarks.Comment: Accepted in CVPR 201
- …